The Dunnhumby Dataset consists of anonymized shopping data from 2,500 households and spanning over two years. The data is well formatted and requires very little processing. It contains generic information about the household and its composition, as well as consumption information in groceries, broad description of the products, and details about the potential discounts coupons that were redeemed.
In this analysis, we focus mainly on food products since we want to establish a link with openfoodfacts dataset.We hence use only three files of Dunnhumby :
We start studying each feature's statistics in order to identify the most relevant and important ones. Then we merge the three files in order to have one dataframe containing all the significant features. Then we analyse the correlations matrices between attributes of the dataset in order to find potential links.
For easier readability:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
def plotDistributionDataset(df):
n_row, n_col = df.shape
columns = list(df.columns)
plt.figure(num = None, figsize = (5*3,5*np.ceil(n_col/3)))
for i in range(n_col):
plt.subplot(int(np.ceil(n_col/3)),3,i+1)
subdf = df.iloc[:,i]
if (not np.issubdtype(type(subdf.iloc[0]), np.number)):
valueCounts = subdf.value_counts()
valueCounts.sort_index(ascending=True).plot.bar()
else:
if(subdf.dtype == 'int64'):
subdf.hist(bins = subdf.max()-subdf.min())
else :
subdf.hist()
plt.ylabel('counts')
plt.xticks(rotation = 90)
plt.title(f'{columns[i]} (column {i})')
plt.tight_layout(pad = 1.0, w_pad = 1.0, h_pad = 1.0)
plt.show()
def correct_indecome(x):
if(x == "Under 15K"):
c = "-15K"
elif(len(x) == 6):
c = '-'.join([x.split('-')[0].zfill(3),x.split('-')[1].zfill(4)])
else:
c = x
return c
def correct_marital_status(x):
if(x == "A"):
c = "Married"
elif(x == "B"):
c = "Single"
elif(x == "U"):
c = "Unknown"
return c
def replace_unknowns(hh_comp, marital_status, hh_size):
if(hh_comp == '2 Adults Kids' and marital_status == 'Unknown'):
marital_status = 'Married'
elif(hh_comp == '1 Adult Kids' and marital_status == 'Unknown'):
marital_status = 'Single'
elif((hh_comp == 'Single Male' or hh_comp == 'Single Female') and marital_status == 'Unknown'):
marital_status = 'Single'
elif(hh_size == '1' and marital_status == 'Unknown'):
marital_status = 'Single'
else:
marital_status = marital_status
return marital_status
def ploting_stacked_plot(count,elements,list_specifics,specific,text):
count = np.array(count)
c = pd.DataFrame(np.transpose(count),columns = tuple(list_specifics),index =tuple(set(elements)))
sns.set()
if len(elements) <= 12:
sns.set_palette("Paired", 12)
else:
sns.set_palette("hls", 20)
c.T.plot(kind = 'bar',stacked = True,figsize = (15,10))
plt.ylabel('Purchases Counts per Food')
plt.xlabel(specific)
plt.title(['Amount of purchases of '+text+' foods along '+specific])
plt.show()
def ploting_specific_general(dataset, specific):
list_specifics = dataset.loc[:,specific].value_counts().sort_index(ascending=True).index
list_food = []
for i in range(len(list_specifics)):
test = dataset[dataset.loc[:,specific].apply(lambda x: x == list_specifics[i])]
list_food.append(list(test.SUB_COMMODITY_DESC.value_counts()[:10].index))
count = 0
elements3 = []
elements2 = []
for l in list_food:
elements1 = []
for element in l:
inter = list_food.copy()
inter.remove(l)
list_final = [x for xs in inter for x in xs]
if(element not in list_final or list_final.count(element) <int(0.50*len(list_specifics))):
elements1.append(element)
elements3.append(element)
else:
elements2.append(element)
count += 1
count1 = []
count2 = []
for list_specific in dataset.loc[:,specific].unique():
for food in set(elements2):
count2.append(int(dataset[(dataset.loc[:,specific] == list_specific) & (dataset.SUB_COMMODITY_DESC == food)].shape[0] / \
dataset[(dataset.loc[:,specific] == list_specific)].shape[0] * 100))
count1.append(count2)
count2 = []
ploting_stacked_plot(count1,elements2,list_specifics,specific,'most commun')
count1 = []
count2 = []
for list_specific in dataset.loc[:,specific].unique():
for food in set(elements3):
count2.append(int(dataset[(dataset.loc[:,specific] == list_specific) & (dataset.SUB_COMMODITY_DESC == food)].shape[0] / \
dataset[(dataset.loc[:,specific] == list_specific)].shape[0]*100))
count1.append(count2)
count2 = []
elements3 = list(set(elements3))
count1_array = np.array(count1)
elements = []
count1 = np.array([])
index_col = []
for i in range(len(count1_array[0])):
if(np.count_nonzero(count1_array[:,i]) != 0 and np.count_nonzero(count1_array[:,i]) <= 2):
elements.append(elements3[i])
else:
index_col.append(i)
count1 = np.delete(count1_array,index_col, 1)
ploting_stacked_plot(count1,elements,list_specifics,specific,'specific')
In this part, we load the three subdatasets we decided to use and conduct some preliminary study on each one of them.
_Comments:
The hhdemographic subdataset contains demographic information on 801 households. For each one, it offers the age range, the marital status, the income range, type of ownership, household composition, size, and number of kids. This dataset is not complete since the Dunnhumby study was done on 2500 households, while there is demographic information on 801 only.
demographic = pd.read_csv('dataset/hh_demographic.csv', delimiter=',')
print(f'The number of rows and columns in the dataframe is respectively {demographic.shape[0]} and {demographic.shape[1]}')
demographic.head()
Comments:
We can see that for the demgraphic subdataset, there is no NaN values so there is no cleaning to do on this part.
demographic.isna().all()
Comments:
To have a better visualization of the plots below, we preprocess the class names on the Income feature so therefore we have it in ascending order.
demographic.INCOME_DESC = demographic.INCOME_DESC.apply(correct_indecome)
demographic.MARITAL_STATUS_CODE = demographic.MARITAL_STATUS_CODE.apply(correct_marital_status)
Comments:
We use our custom-made function in order to plot the distributions of each feature of the demographic dataset.
AGE : The age feature is made of 6 classes ranging from 19 years old to +65. The most common class (the one with the maximum value counts) is '45-54', with a value count of 288. It is also easily noticeable that the distribution between classes is not uniform.Marital Status : This feature is made of 3 classes: "Married, Single and Unknown". The most commonn classes are 'Married' and 'Unknown' , with a value count of 340 and 344, respectively (The Unknown values will addressed later). For this feature also, the distribution is not uniform between classes. Income : Made of 12 classes, this feature ranges from an income of under of 15K US dollars to +250k. The most common class is '50-74K', with a value count of 192. For this feature also, the distribution is not uniform between classes.Homeowner: This one specifies the type of ownership of the household. This feature identifies if the individuals living in this house are homeowners or renters. It is made of 5 classes: 'Homeowner', 'Probable Owner', 'Probable Renter', 'Renter' an 'Unknown'. The most common type of ownership is 'Homeowner.Household composition : This feature contains 6 classes detailing the composition of housholds. The most common class is '2 adults no kids'. Since this feature contains much less 'Unknown' values thann 'Marital Status', and could provide information about this latter, we intend to use this feature to complete the missing values of marital status.Household Size: This feature gives the number of individuals living in a household. It is made of 5 classes ranging from 1 to +5 individuals. This attribute can also be used as an indicator for completing the marital status feature.Kid Category: This attribute is the number of kids per household. It has 4 classes, '1' kid, '2' kids, '+3' kids as well as 'Unknown'. This latter is the most common one.Household Key : we notice that the dataset contains demographic information for a portion of households, meaning that the demogrphic is not available for all households.plotDistributionDataset(demographic)
_Comments:
As we said previously, we use Household Composition and Household Size to replace the unknown values of Marital Status. Indeed, since these information are conceptually linked, we could infer from one the value of the other. We implemented a rule based function called replace_unknowns() that assigns a class to the marital status depending on the values of household composition and size. We consider 2 adults with kids as married, 1 adult with kids as single, single male/female as single, and if size is 1 we also assign single.
We can see that have been able to reduce the number of Unknown values from 344 to 95. We won't drop the remaining Unknowns since we intend to use mainly income, age and composition information, and we don't want to lose those rows._
demographic.MARITAL_STATUS_CODE = demographic.apply(lambda row: replace_unknowns(row['HH_COMP_DESC'],
row['MARITAL_STATUS_CODE'], row['HOUSEHOLD_SIZE_DESC']),axis=1)
demographic.MARITAL_STATUS_CODE.value_counts()
Comments:
Now we try to take a look at the dependancy between our features. We start by one hot encoding our features since they are categorical.Then plot a correlation heatmap to compare features. We observe the following:
one_hot_demographic = pd.get_dummies(demographic, columns=['AGE_DESC', 'MARITAL_STATUS_CODE', 'INCOME_DESC', 'HOMEOWNER_DESC',
'HH_COMP_DESC', 'HOUSEHOLD_SIZE_DESC', 'KID_CATEGORY_DESC'])
corr = one_hot_demographic.corr(method = 'pearson')
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
cmap = sns.diverging_palette(220, 10, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
fig, ax = plt.subplots(figsize=(15,15)) # Sample figsize in inches
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5});
Comments:
The transaction data subdataset contains all the purchases made by households within the study.This table is transaction centered, it essentially provides for each transaction the ID of the household which bought it, the product ID, the quantity bought, the sales value which is the margin gained by the retailer, as well as the transaction time, day and week.
transaction = pd.read_csv('dataset/transaction_data.csv', delimiter=',')
print(f'The number of rows and columns in the dataframe is respectively {transaction.shape[0]} and {transaction.shape[1]}')
transaction.head()
Comments:
Since we saw in the demographic analysis that the first subdataset doesn't include all the household IDs. It is pertinant to keep the transactions made by households with provided demographic data only, thus the following.
demographic_households = demographic.household_key.unique()
transaction = transaction[transaction.household_key.apply(lambda x: x in demographic_households)]
print(f'The number of rows and columns in the dataframe is now respectively {transaction.shape[0]} and {transaction.shape[1]}')
Comments:
This dataset details all the transactions made by each household. Therefore, it gives information on their consumption over time. Using the custom function to plot the distributions of each feature, we get the following plots and observations:
Week_no : This feature specifies the week during which the transaction has been made. It ranges from 0 to 102. We can see that the consumption starts low in the first weeks, and increases rapidly to reach a trend plateau of approximately 28000 transactions per week. Also we notice that in the last week, consumption explodes to nearly 50000 transctions. To keep the data uniform and have a fixed consumption over the weeks, we intend to drop the transactions made during the first and last weeks._Transaction Time : This feature provides the time during the day at which the transaction has been made. It ranges from 0 to 2400, corresponding to the 24 hours of the day. We can see that most of the transactions are made during the day between 10:00 and 22:00, with a maximum of consumption at 18:00.Day: This feature specifies in which day the transaction has been made. It just gives a more precise date than the weekno attribute. As in this latter, we can see that there is a low consumption during the first days increasing to reach a plateau of approximately 4000 transactions per day. Since this feature represents a higher time resolution, we will drop the transactions made in the first and last few days to make the data more uniform over time.
_The custom function does not give correct plots for the features QUANTITY and SALES_VALUE, because the distributions of these attributes are extremely skewed, this is why we plot them separatly. We obseve the following:_Quantity : This feature gives the quantity of the product bought at each transaction. It is heavy tailed with value that reach more than 89000. However most of the values are between 0 and 10. Sales value : This attribute reprensents the margin gained by the retailer in the transaction. The price of the product could be derived from this value, which we'll do later. It is also highly skewed but most values range between 0 and 20.plotDistributionDataset(transaction.loc[:,['WEEK_NO','TRANS_TIME','DAY']])
plt.figure(num = None, figsize = (15,10))
plt.subplot(2,2,1)
ax1 = transaction.SALES_VALUE.hist(bins=1000)
ax1.set_xlim(0,20)
plt.subplot(2,2,2)
ax2 = transaction.QUANTITY.hist(bins=100000)
ax2.set_xlim(0,20)
ax1.title.set_text('SALES_VALUE')
ax2.title.set_text('QUANTITY')
plt.show()
Comments:
We plot a heatmap of the correlations between the transaction features, we observe the following:
corr = transaction.corr(method = 'pearson')
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
cmap = sns.diverging_palette(220, 10, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
fig, ax = plt.subplots(figsize=(10,10)) # Sample figsize in inches
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5});
_Comments:
For further use, we divide the Sales_value feature into two different columns, one where we keep the actual sales value, and the other where we sum all the sales values of transactions made by a household during the whole study. It results in an indicator of the total money spent by the household during the period of the study._
behavior_consumer = pd.DataFrame(transaction.groupby(['household_key']).sum().SALES_VALUE)
behavior_consumer['SALES_VALUE_TOTAL'] = behavior_consumer['SALES_VALUE']
behavior_consumer = behavior_consumer.drop(columns=['SALES_VALUE'])
transaction = pd.merge(transaction, behavior_consumer, on='household_key')
transaction.head()
_Comments:
The products subdataset contains the list of products purchased by households during the study. It gives information about each product sold as type of product,national or private label and brand identifier. The main goal of the project is to analyse people's food consumption, we limited our analysis on food related product ('listfood'). We plot the distributions of each feature, we get the following plots and observations:
DEPARTMENT : Groups similar products together, the majority of product sold where in the Grocery departement. It has 10 unique classes. BRAND : Indicates Private or National label brand, the large majority of product sold came from national brands. COMMODITY_DESC : Groups similar products together at a lower level. It provides 183 types of commodities (Soft drinks, bag snacks, frozen meat ...)_SUB_COMMODITY_DESC : Groups similar products together at the lowest level. It has 1319 classes (potato chips, spices and seasoning, beermalt liquors ...)_product = pd.read_csv('dataset/product.csv', delimiter=',')
print(f'There are {product.shape[0]} rows and {product.shape[1]} columns')
product.head()
_Comments:
We take a look at the distributions of the features Department and Brand since they have a limisted number of classes. We notice that most of the products correspond to the class 'Grocery', with a value count of nearly 40000, while the other classes contain less than 5000 products each. Also, we observe that more than 40000 products are labeled as 'National' while only 1000 are labeled as private.
Since the subdataset contains not only food related products we decided to drop some of them based on the department class. We only keep the departments mentioned in the list below called list_food._
list_food = ['GROCERY','PASTERY','PRODUCE','NUTRITION','MEAT','FROZEN GROCERY','SALAD BAR','SEAFOOD','SPIRITS','SEAFOOD-PCKGD','MEAT-PCKG','DELI']
food_related_products = product[product.DEPARTMENT.apply(lambda x: x in list_food)]
print(f'There are now {food_related_products.shape[0]} rows and {food_related_products.shape[1]} columns')
plt.figure(num = None, figsize = (15,10))
plt.subplot(2,2,1)
ax1 = food_related_products.DEPARTMENT.value_counts().plot.bar()
plt.subplot(2,2,2)
ax2 = food_related_products.BRAND.value_counts().plot.bar()
ax1.title.set_text('DEPARTMENT')
ax2.title.set_text('BRAND')
plt.show()
Comments:
In order to simplify the analysis of the dataset, we decided to merge the 3 subdatasets. We first drop the unwanted features from the subsets, then we merge the food related products and transactions subsets based on the product id. After that, we merge this new intersection dataset with the demographic subset based on the Houselhold key. Both mergers are made as an 'inner' join of subsets because we only want to keep the common inner features.
The new dataset contains all the features. It is household centric, for each household we have the information on the consumed products as well as on the purchase which makes the analysis simpler.
food_related_products.drop(columns = ['BRAND','CURR_SIZE_OF_PRODUCT','MANUFACTURER'],inplace = True)
consumution = transaction.drop(columns = ['STORE_ID','RETAIL_DISC','TRANS_TIME','COUPON_DISC','COUPON_MATCH_DISC'])
inter = pd.merge(food_related_products,consumution,on = 'PRODUCT_ID')
full_data_set = pd.merge(inter,demographic,on = 'household_key')
behavior_consummer = full_data_set.groupby(['household_key']).sum().SALES_VALUE
full_data_set.head()
Comments:
We try to take a look at relevant correlations between the features that have been bought together in the new dataset. We decide to first evaluate the corrlation between income and age on the consumed products as well as the total spent money. We start by one hot encoding the feature Income then we plot a heatmap of the correlations, we observe the following:
full_data_set_test = pd.merge(full_data_set,behavior_consummer,on = 'household_key')
one_hot_income = pd.get_dummies(full_data_set_test, columns=["INCOME_DESC"])
corr = one_hot_income.corr(method = 'pearson')
corr.SALES_VALUE_y.sort_values(ascending = False)
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
cmap = sns.diverging_palette(220, 10, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
fig, ax = plt.subplots(figsize=(10,10)) # Sample figsize in inches
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5});
Comments:
Now we do the same but with the AGE feature, in orther to evaluate correlations between age and consumed products and their sales value, we observe that:
full_data_set_test = pd.merge(full_data_set,behavior_consummer,on = 'household_key')
one_hot_income = pd.get_dummies(full_data_set_test, columns=["AGE_DESC"])
corr = one_hot_income.corr(method = 'pearson')
corr.SALES_VALUE_y.sort_values(ascending = False)
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
cmap = sns.diverging_palette(220, 10, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
fig, ax = plt.subplots(figsize=(10,10)) # Sample figsize in inches
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5});
Comments:
From the correlations previously computed, we noticed that there is a slight correlation between the age and the income of the household on their choice of food products depending on their sales value. We want to inspect more this correlation by looking at which foods are common to most of income classes and age categorie, and which ones are specific to each class. This would help us analyse the disparity of food consumption between social classes with the income and age as social indicators.
Comments:
For the income:
In order to classify them as common foods, the product needs to be consumed by at least half of all income categories. Moreover, to classify them as specific foods, the product needs to be consumed by at most two different income categories.
Results for common foods: From the plot, we can say that every class consumed 6 items (‘bananas’,’ yogurt ’,’ bread ’,’ milk ’,’ soft drinks’, ‘cheese’). This six items are considered as commodities necessary for a balanced diet. Furthermore, this items are consumed in almost the same quantity for all the categories.
Results for specific food products: the results are more significant in this case, lower income categories consumed mainly base products. And higher incomes categories consumed more exotic food and healthy food such as strawberries, soup, premium food, 100% fruit juice.
ploting_specific_general(full_data_set,'INCOME_DESC')
Comments:
For the age:
In order to classify them as common foods, the product needs to be consumed by at least half of all age categories. Moreover, to classify them as specific foods, the product needs to be consumed by at most two different age categories.
Results for common foods: From the plot, we can say that every class consumed 6 items (‘bananas’,’ yogurt ’,’ bread ’,’ milk ’,’ soft drinks’, ’cheese’). This six items are considered as commodities necessary for a balanced diet. Furthermore, this items are consumed in almost the same quantity for all the categories. We also have a new item that is ‘Potato chips’, that is mainly consumed by this two categories of age : ’19-24’,’25-34’.
Results for specific food products: the results are much more interesting than before, each age category, consumes different products :’45-54’ consumes as necessary ‘beer’, the'50-64’ categori consumes ‘Kids cereal’ meaning that they have children’s but, there was no visible correlation between the this category of age and kids, they also consumes ‘Economy dinners’.Finaly,the ‘65+‘ category are healthier and consumed more healthier food 100% fruit juice.
ploting_specific_general(full_data_set,'AGE_DESC')
Comments:
For the marital status,We observe the same status as before.
ploting_specific_general(full_data_set,'MARITAL_STATUS_CODE')
Comments:
One thing the verify is if the different households buy always different items or they have habits when buying their food.
From the histogram below, we can see that 63% of product are reordered and 35% of product are not.
reordered = pd.DataFrame(full_data_set.groupby(['household_key','SUB_COMMODITY_DESC']).count()[['DAY']].DAY.apply(lambda x: True if x>1 else False))
reordered.DAY.value_counts()
plt.figure(num = None, figsize = (10,10))
plt.subplot(1,1,1)
ax = reordered.DAY.value_counts().plot.bar()
ax.title.set_text('Reoredered products')
plt.show()
The Dunhumby dataset gives us valuable demographic information but also informations on how people consume and what they consume. With this dataset we were able to evaluate the social status of each household,and correlate this information with their food consumption.
The Open Food Fact is a collaborative dataset of food products from arount the world, with ingredients, allergens, nutrition facts and all information we can find on product labels.
Open Food Facts Database is a 2.1GB CSV file with tab separators. We will use the folowing features for our analysis :
import math
import pickle
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.ticker import MaxNLocator
from wordcloud import WordCloud, STOPWORDS
from scipy import stats
from sklearn import manifold, datasets
from adjustText import adjust_text
def highlight(f, n_labels, size=(10,10)):
x, y = Y.T
names = df_nutrition.product_name[X.index]
names_list = names.tolist()
colors = names.apply(lambda s: 'red' if f(s) else 'lightgray').tolist()
fig, ax = plt.subplots(figsize=size)
sc = plt.scatter(x, y, c=colors, s=5)
texts = [plt.text(x[i], y[i], names_list[i], fontsize=10) for i in np.random.choice(len(X), n_labels)]
adjust_text(texts)
plt.show()
def highlight2(cols, title, size=(10,10)):
x, y = Y.T
def exists_col(r):
for c in cols:
if r[c] == 1:
return True
return False
colors = matrix.loc[X.index.intersection(matrix.index)].apply(lambda r: 'red' if exists_col(r) else 'lightgray', axis=1).tolist()
fig, ax = plt.subplots(figsize=size)
plt.axis('off')
sc = plt.scatter(x, y, c=colors, s=5)
texts = [plt.text(x[i], y[i], names_list[i], fontsize=7) for i in np.random.choice(len(X), 0)]
adjust_text(texts)
plt.title(title)
plt.show()
openfoodfacts = pd.read_csv('dataset/dataset_open_food.csv','\t')
print('Shape of the dataset:',openfoodfacts.shape)
openfoodfacts.head()
#--> for the moment
nan_column = openfoodfacts.isnull().sum(axis=0).sort_values(ascending=False)
plt.figure(figsize=(10,5))
plt.title('Distributions of NAN over our columns')
nan_column.hist(bins='auto')
plt.xlabel('Number of NaN')
plt.ylabel('Number of columns')
plt.tight_layout()
plt.show()
plt.figure(figsize=(10, 50))
plt.rcParams['axes.facecolor'] = 'white'
plt.rc('grid')
(openfoodfacts.isnull().mean(axis=0)*100).plot.barh()
plt.xlim(xmax=100)
plt.title("Missing values rate")
plt.xlabel("percentage")
plt.show();
Comments:
_Furthermore, there are some missing values in the dataset. Actually some of these columns are nearly empty, for example the composition columns ("cocoa_100g", "zinc100g", etc...). But this can easily be explained by the fact that it is impossible to put on every product's label its complete composition.
Moreover, there is the possibility to discard features with more than 60% missing values (for instance), because these features are inconsistent.
However, we do not want to do that because in the following parts, we are going to discuss some specific aspects of our food activity so we will only take the parts from this dataset (the parts that are of interest of course).
Comments:
We focused on most part our analysis on American products,this is a decision we made in order to complete the Dunhumby dataset which is the US.
_Now, We first focus mainly on the nutrition table below, this will help to correct mistakes in the dataset. By comparing the energy100g and calculated energy from carbohydrates, proteins and fat, we can delet some wrong entries. If the sum of element in 100g of product is more than 100g this entry is wrong and can be deleted.
nutrition_table_cols = ["product_name", "ingredients_text", "allergens","energy_100g",
"fat_100g",
"carbohydrates_100g",
"sugars_100g",
"proteins_100g",
"salt_100g",
"allergens",
"ingredients_from_palm_oil",
"additives","countries_en",
"nutrition_grade_fr","additives_tags"]
df_nutrition = openfoodfacts[nutrition_table_cols].copy()
interest = df_nutrition.product_name.dropna().index.intersection(df_nutrition.ingredients_text.dropna().index)
interest = interest.intersection(df_nutrition[df_nutrition.countries_en == "United States"].index)
df_nutrition = df_nutrition.iloc[interest]
df_nutrition.head()
df_nutrition["sum_elements"] = df_nutrition.fat_100g + df_nutrition.carbohydrates_100g +\
df_nutrition.proteins_100g
df_nutrition["sum_elements"] = round(df_nutrition.sum_elements)
df_nutrition["other_carbs"] = df_nutrition.carbohydrates_100g - df_nutrition.sugars_100g
df_nutrition["reconstructed_energy"] = df_nutrition.fat_100g * 37 + \
(df_nutrition.proteins_100g + df_nutrition.carbohydrates_100g)* 17
df_nutrition.head()
print(df_nutrition.reconstructed_energy.describe())
print('='*40)
print(df_nutrition.energy_100g.describe())
print('='*40)
print(df_nutrition.sum_elements.describe())
Comments:
We notice that there are some obvious outliers :
We can detect the entries that are wrong on the plot below :
df_nutrition = df_nutrition.loc[df_nutrition.reconstructed_energy<=3700]
df_nutrition = df_nutrition.loc[df_nutrition.reconstructed_energy>0]
df_nutrition = df_nutrition.loc[df_nutrition.sum_elements<=100]
df_nutrition = df_nutrition.loc[df_nutrition.sum_elements>0]
df_nutrition = df_nutrition.loc[df_nutrition.energy_100g>0]
df_nutrition = df_nutrition.loc[df_nutrition.energy_100g<=3700]
df_nutrition.head()
# Visualization of the errors
plt.figure(figsize=(10,5))
plt.title('Distributions of given and calculated energy of the products')
plt.scatter(df_nutrition["energy_100g"], df_nutrition["reconstructed_energy"])
plt.xlabel('Energy_100 present in the dataset')
plt.ylabel('Calculated energy')
plt.tight_layout()
plt.show()
Comments: The relation between the calculated energy is linearly correlated with the given energy.The cleaning process is succesful for this part.
Comments:
The challenge is now to parse and extract useful information from this ingredients text fields. It looks like there is no particular format in the data, except that the ingredients are separated by commas. The first intuition is to take advantage of that, namely split the ingredients on the comma character, trim them, convert them lower case string and explode the resulting series. This already gives a decent result but many rows contain additional data such as quantity, and that affects our grouping function. It turns out using some regular expressions we can normalize most of the rows without much work.
# Pipeline to extract the ingredients - can be improved but already works quite well
ingredients = df_nutrition.ingredients_text.apply(lambda s: list(map(lambda t: t.strip(), re.sub("\(.+?\)", '', s.lower()).split(','))))
ingredients = ingredients.explode()
exploded = ingredients.str.lower().str.replace("^.*?[0-9]+[^ ]*? ", '').str.replace(" [0-9].*$", '').str.replace("([0-9]*%|\.|[_\(\)\:\*\[\]])", '').str.replace('^[0-9]+$', '').str.strip()
exploded = exploded.str.replace(' +', ' ').str.replace('œ', 'oe').str.replace('[éèêë]', 'e').str.replace('[à âä]', 'a').str.replace('[îï]', 'i').str.replace('[uûùü]', 'u').str.replace('[ôö]', 'o')
exploded = exploded[exploded.apply(lambda s: len(s) > 0)]
by_count = pd.Series(exploded).value_counts()
counts = pd.DataFrame()
counts['ingredient'] = list(by_count.index)
counts['apparition'] = list(by_count)
d = {}
counter = 0
for a in counts.ingredient:
d[a] = counts.apparition[counter]
counter = counter + 1
wordcloud = WordCloud(width = 1000, height = 500, background_color="white")
wordcloud.generate_from_frequencies(frequencies=d)
plt.figure(figsize=(20,10))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()
Comments:
With this first analysis, we can already see some obvious patterns emerging: without any surprise the three most common ingredients are salt, sugar and water.
Now we would like to plot our products on a 2D map grouped by clusters given their ingredients. This can be done using a dimensionality reduction algorithm. One of them is called t-SNE and can achieve this task fairly efficiently. The idea is to create a matrix of products which has 1k columns, one for each ingredient in the ranking by number of occurrence, and each cell containing a binary value describing if the ingredient is present or absent for this product. Ingredients beyond this ranking are irrelevant because they appear only so rarely, moreover it would make the algorithm unnecessarily slower.
# Number of ingredients (= dimensions) to consider
n_dimensions = 1000
# Number of rows (= datapoints) to consider at most
n_rows = 100000
# Minimum number of ingredients appearing in the list
min_ingredients = 3
relevant = by_count.head(n_dimensions).index.tolist()
products = exploded.iloc[df_nutrition.head(n_rows).index]
values = {}
for col in relevant:
values[col] = products.apply(lambda s: 1 if s == col else 0).groupby(products.index).max()
matrix = pd.DataFrame(values)
matrix = matrix[matrix.sum(axis=1) >= min_ingredients]
matrix.head(5)
Comments:
The matrix has the shape we want. Note that in the mean time we dropped rows with less than 3 ingredients in the matrix because they would create irrelevant datapoint.
We are now ready to apply the algorithm on the matrix (The execution takes some time).
X = matrix.head(10000)
n_components = 2 # Number of dimensions to reduce to
perplexity = 100 # t-SNE parameter (the higher the better the result, but also the more expensive)
tsne = manifold.TSNE(n_components=n_components, init='random', random_state=0, perplexity=perplexity)
Y = tsne.fit_transform(X)
Comments:
Here we plot some labels to roughly familiarize ourselves with the shape of our map. Here are some possible clusters that the algorithm might have identified:
We'll see later a robust method to confirm this intuition.
highlight(lambda s: False, 70, size=(15,10))
Comments:
In fact we want to make sure that the clustering is actually meaningful, one way to do that is to prove that the reduced dimensionality is correlated with an independent variable (= unused in the algorithm). One such variable would be the product name. Here we plot in red the products having chocolat in their name. Clearly the product names are correlated with the reduced dimensions. It also confirms the existence of one of the cluster we hypothesized earlier ("chocolate sweets").
highlight(lambda s: 'chocolate' in s.lower(), 0)
Comments:
Here are some other examples with more keywords:
highlight(lambda s: 'fruit' in s.lower(), 0)
Comments:
Let's highlight the ingredients individually to reason about our previous discoveries. Remark that we cannot make any conclusion about the position of the clusters in these plots because it is merely the result of a data representation algorithm. Simply put we only reordered the points so that products with similar ingredients fall close to each other.
highlight2(['salt'], 'Presence of salt in products')
highlight2(['sugar'], 'Presence of sugar in products')
highlight2(['water'], 'Presence of water in products')
highlight2(['palm oil'], 'Presence of palm oil in products')
# Size of the correlation matrix
n_correlations = 20
cmatrix = np.zeros((n_correlations, n_correlations))
columns = matrix.columns.tolist()[:n_correlations]
# https://en.wikipedia.org/wiki/Phi_coefficient
def phi_corr(n):
top = n[1][1] * n[0][0] - n[1][0] * n[0][1]
bot = math.sqrt((n[1][0] + n[1][1]) * (n[0][0] + n[0][1]) \
* (n[0][0] + n[1][0]) * (n[0][1] + n[1][1]))
return top / bot
for i in range(n_correlations):
a = columns[i]
for j in range(i):
b = columns[j]
m = matrix[[a, b]]
vs = [0, 1]
result = np.zeros((len(vs), len(vs)))
for x in vs:
for y in vs:
result[x][y] = len(m[(m[a] == x) & (m[b] == y)])
cmatrix[i][j] = phi_corr(result)
mask = np.zeros_like(cmatrix, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
m = np.max(np.abs(cmatrix))
fig, ax = plt.subplots(figsize=(10,10))
ax.set_title('Correlations between the top %s ingredients' % n_correlations)
sns.heatmap(cmatrix, ax=ax, square=True, mask=mask, vmin=-m, vmax=m, xticklabels=columns, yticklabels=columns, cmap='coolwarm');
Comments:
Now we will study some of the most important and trending topics in healthiness. In major health-related magazines, we usually talk about the following topics:
These are some negative aspects of our food consumption, and we are going to study these aspects in regard of the french food nutrition grade (given in many countries). Specifically, we are going to study the World and US consumptions concerning these aspects.
In this "additives" part, we are going to discuss the effect of the dangerous food additives on its nutrition grade, and consequently our everyday food consumption.
In fact, the nutrition grade is given in the range from "a" to "e" (from best grade to worst grade). But we want to know if the food products that have dangerous additives, or most additives will get good or bad nutrition grades.
Additionally, we will compare these scores with the products that do not have any additives.
Hence, in order to determine which additives are bad we are going to first use the list in the "Hungry for Change" website. This list contains the most dangerous food additives that we can find on the market, but are not forbidden. This list was collected using different scientific studies on the matter.
Source: http://www.hungryforchange.tv/article/top-10-food-additives-to-avoid)
partb = pd.read_pickle('./dataset/partb.pkl')
additives_pickle = pd.read_pickle('./dataset/additives.pkl')
additives_new = additives_pickle.additives_tags.dropna().map(lambda x : x.lower())
additives_e250 = additives_new.str.contains('e250')
additives_nutrition = additives_pickle.nutrition_grade_fr
additives_both = additives_pickle.iloc[additives_e250.index]
additives_both[additives_both.nutrition_grade_fr.notnull()].sample(5)
from scipy.stats import beta
order = ['a', 'b', 'c','d','e']
def exact_CI(x, N, alpha=0.95):
x = float(x)
N = float(N)
p = round((x/N)*100,2)
intervals = [round(i,4)*100 for i in beta.interval(alpha,x,N-x+1)]
intervals.insert(0,p)
result = {'Proportion': intervals[0], 'Lower CI': intervals[1], 'Upper CI': intervals[2]}
return intervals
def grade_dist(title, df_dist, switch_CI=False):
if (switch_CI == True):
tmp_df = df_dist.apply(pd.value_counts).loc[order]
tmp_values = tmp_df.values
for i in range(len(tmp_values)):
intervals = exact_CI(tmp_values[i], tmp_df.sum())
print("The interval of confidence for the grade " + str(i) + " is between: " + str(intervals[1]) + " and " +
str(intervals[2]))
fig = df_dist.apply(pd.value_counts).loc[order].plot(kind='bar', subplots=True, color = 'green')
plt.rcParams['axes.facecolor'] = 'white'
plt.title(title)
plt.xlabel("Nutrition Grade")
plt.ylabel("Number of products")
plt.show();
additives_result_e250 = pd.DataFrame(additives_both.nutrition_grade_fr)
grade_dist("Grade Distribution for E250 additive in the World", additives_result_e250, True)
Comment:
Here we do find the e250 additive ( used for curing or preserving meat and fish products) which can cause acute methemoglobinemia (haemoglobin loses its ability to carry oxygen), irritability, lack of energy, headache, brain damages or even death in severe untreated cases. However, what is alarming is that we find many products which have this additive, but keep a "a" and "b" nutrition grade.
Now, we are going to do the same analysis on a the list given by the "Hungry for Change" website on the world dataset, and on the specific US dataset.
list_dangerous_additives = ['e951','e621', 'e133', 'e124','e110', 'e102', 'e221', 'e320', 'e220']
kwstr = '|'.join(list_dangerous_additives)
mask = additives_new.to_frame().stack().str.contains(kwstr).any(level=0)
df_additives = additives_new[mask]
additives_all = additives_pickle.iloc[df_additives.index]
additives_result_all = pd.DataFrame(additives_all.nutrition_grade_fr)
grade_dist("Grade Distribution for dangerous additives in the World", additives_result_all)
partb_new_null_additives = partb.additives_tags.isnull()
df_additives_null = partb.iloc[partb_new_null_additives.index]
df_result_additives_null = pd.DataFrame(df_additives_null.nutrition_grade_fr)
grade_dist("Grade Distribution for Non-Additives in the World", df_result_additives_null)
Comment:
As you can see from the two plots above, is that most of the products in the world do have a nutrition grade of "d", which is a pretty bad grade, but is not the worst grade.
Naturally, we will get less products with good grades ("a" and "b") in the grade distribution for additives, since products with additives should be more dangerous for our health.
partb.countries = partb.countries.str.lower()
# Fix some of the names with multiple entries (on ISO-apha 2 codes and country name)
partb.loc[partb['countries'] == 'en:fr','countries'] = 'france'
partb.loc[partb['countries'] == 'en:es','countries'] = 'spain'
partb.loc[partb['countries'] == 'en:gb','countries'] ='united kingdom'
partb.loc[partb['countries'] == 'en:uk','countries'] ='united kingdom'
partb.loc[partb['countries'] == 'españa','countries'] ='spain'
partb.loc[partb['countries'] == 'us','countries'] = 'united states'
partb.loc[partb['countries'] == 'en:us','countries'] ='united states'
partb.loc[partb['countries'] == 'usa','countries'] = 'united states'
partb.loc[partb['countries'] == 'en:cn','countries'] = 'canada'
partb.loc[partb['countries'] == 'canada','countries'] = 'canada'
partb.loc[partb['countries'] == 'en:au','countries'] = 'australia'
partb.loc[partb['countries'] == 'australia','countries'] = 'australia'
partb.loc[partb['countries'] == 'en:de','countries'] ='germany'
partb.loc[partb['countries'] == 'deutschland','countries'] ='germany'
partb.loc[partb['countries'] == 'en:be','countries'] ='belgium'
partb.loc[partb['countries'] == 'belgium','countries'] ='belgium'
partb.loc[partb['countries'] == 'en:ma','countries'] ='morocco'
partb.loc[partb['countries'] == 'morocco','countries'] ='morocco'
partb.loc[partb['countries'] == 'en:ch','countries'] ='switzerland'
partb.loc[partb['countries'] == 'switzerland','countries'] ='switzerland'
partb.loc[partb['countries'] == 'en:br','countries'] ='brazil'
partb.loc[partb['countries'] == 'brazil','countries'] ='brazil'
partb.loc[partb['countries'] == 'en:mx','countries'] ='mexico'
partb.loc[partb['countries'] == 'mexico','countries'] ='mexico'
partb.loc[partb['countries'] == 'en:ru','countries'] ='russia'
partb.loc[partb['countries'] == 'russia','countries'] ='russia'
partb.loc[partb['countries'] == 'en:dz','countries'] ='algeria'
partb.loc[partb['countries'] == 'algerie','countries'] ='algeria'
partb.loc[partb['countries'] == 'en:za','countries'] ='south africa'
partb.loc[partb['countries'] == 'south africa','countries'] ='south africa'
partb.loc[partb['countries'] == 'en:in','countries'] ='india'
partb.loc[partb['countries'] == 'india','countries'] ='india'
partb.loc[partb['countries'] == 'en:cn','countries'] ='china'
partb.loc[partb['countries'] == 'china','countries'] ='china'
## US plot ##
us = ['united states']
partb_us = partb[partb.countries.isin(us)]
partb_additives_us = partb_us.additives_tags.dropna().map(lambda x : x.lower())
additives_us = partb.iloc[partb_additives_us.index]
result_additives_us = pd.DataFrame(additives_us.nutrition_grade_fr)
grade_dist("Grade Distribution for dangerous additives in the US", result_additives_us)
partb_new_us_null_additives = partb_us.additives_tags.isnull()
df_us_additives_null = partb.iloc[partb_new_us_null_additives.index]
df_result_us_additives_null = pd.DataFrame(df_us_additives_null.nutrition_grade_fr)
grade_dist("Grade Distribution for Non-Additives in the US",df_result_us_additives_null)
Comment:
Regarding the US grade distribution for additives and non-additives products, we can highlight the fact that the US follow the same trends as the world when it concerns additives.
However, it is worth noticing that the second most frequent grade for "additives" products is "e" is the United States, and this is important because intuitively we should have more products in the right part of this distribution.
Now, we will study the distribution of those additives in many specific countries.
countries = ['france','united kingdom','spain','germany','united states','australia','canada', 'belgium', 'morocco', 'switzerland', 'brazil', 'mexico',
'russia', 'algeria', 'south africa', 'china', 'india']
df_countries = partb[partb.countries.isin(countries)]
df_countries_additives = df_countries[df_countries.additives_n.notnull()]
df_groupedby_countries_additives = df_countries_additives.groupby(['countries']).mean().additives_n.reset_index()
np_countries_additives = np.array(df_groupedby_countries_additives)
np_countries_additives = np_countries_additives[np_countries_additives[:,1].argsort()[::-1]]
# Plot the average number of additives per country
fig = plt.figure(figsize=(23,8))
ax1 = fig.add_subplot(1,1,1)
y_pos = np.arange(len(np_countries_additives[:,0]))
x_pos = np_countries_additives[:,1]
x_ticks = np_countries_additives[:,0]
# Make a barplot
plt.bar(y_pos, x_pos, align='center')
plt.title('Average number of additives per product by country')
plt.xticks(y_pos, x_ticks)
plt.ylabel('Average number of additives')
plt.show()
import plotly.express as px
import country_converter as coco
cc = coco.CountryConverter()
#we create here the ISO-3 country codes in order to plot our map in plotly
df_groupedby_countries_additives['iso_code'] = df_groupedby_countries_additives['countries'].apply(lambda x : coco.convert(names=x, to='ISO3'))
fig = px.choropleth(df_groupedby_countries_additives, locations="iso_code",
color="additives_n", # additives_n is a column of our dataframe
hover_name="countries", # column to add to hover information
color_continuous_scale=px.colors.sequential.Plasma)
fig.show()
Comments:
Here, we first ranked 17 countries from our list of countries in the dataset, in order to identify the countries which use the biggest average number of additivs per product.
Normally, this ranking should reflect the food consumption habits and use of additives of its population.
It can be clearly seen that South Africa, Mexico, Belgium and Australia use the most additives out of these countries. This demonstrates some unhealthy additives habits of these countries but most importantly a lack of food regulation in some of these government administrations.
However, the countries that use the least number of additives are china and russia but we suspect that we have too little data in those countries, and this will badly influence our analysis.
Finally, we plotted a map in order to better visualize the differences between the countries from our study.
def prod_dist(title, feature, xlabel):
plt.figure(figsize=(12,10))
ax = partb[partb[feature] > 0.0][feature].value_counts().sort_index().plot(kind='bar',color='skyblue')
plt.title(title,fontsize=14)
plt.xlabel(xlabel)
plt.ylabel('Number of Products')
plt.show();
prod_dist('Distribution of the number of additives by products','additives_n', 'Additives')
def log_dist(title, feature, xlabel):
plt.figure(figsize=(12,10))
ax = partb[partb[feature] > 0.0][feature].value_counts().sort_index().plot(kind='line',color='skyblue')
plt.title(title,fontsize=14)
plt.xlabel(xlabel)
plt.yscale("log")
plt.ylabel('log-number of products')
plt.show();
log_dist('Log-Distribution of the number of additives by products','additives_n', 'Additives')
Comment:
This first plot shows the distribution of the number of additives by products. It can be clearly seen that most of the products use 1 to 5 additives in their composition. However, what is astonishing is that we do find products with more than 15 additives in total ! Based on our research, the types of additives used are regulated and has to be approved prior to use. However, there is no limitations for the number of additives that can be added to food products.
Furthermore, we plotted a log distribution of the number of additives per products, and this clearly follows a power-law distribution !
import matplotlib.patches as mpatches
additives = (additives_pickle['additives_en'].str.extractall("(?P<Count>[E]\d\d\d\w?)"))
additives_count = additives.apply(pd.value_counts).head(20)
additives_count['additives_num'] = additives_count.index
additives_count.reset_index(drop=True,inplace=True)
additives_mapping = {'E330': 'black','E322':'red','E322i':'red','E101':'blue','E375':'black','E101i':'blue',
'E300':'yellow','E415':'red','E412':'black','E500':'black','E471':'red','E203':'green','E407':'red',
'E440':'red','E250':'green','E150a':'blue','E450':'black','E500i':'blue','E331':'black',
'E129':'black','E339':'black','E440i':'red','E160a':'blue','E270':'black','E102':'blue',
'E410':'red','E133':'blue','E341':'black','E428':'red','E621':'black','E202':'blue'}
additives_count['colors'] = additives_count['additives_num'].map(additives_mapping)
ax = additives_count.plot(x='additives_num',y='Count',kind='barh',color=additives_count['colors'],figsize=(15,10))
ax.invert_yaxis()
ax.legend().set_visible(False)
ax.set_title('20 most used additives in food products')
colors = mpatches.Patch(color='blue', label='food colouring')
others = mpatches.Patch(color='black', label='other')
emulsifiers = mpatches.Patch(color='red', label='emulsifiers')
antioxidant = mpatches.Patch(color='yellow', label='antioxidants')
preservatives = mpatches.Patch(color='green', label='food preservatives')
plt.legend(handles=[colors,others,emulsifiers,antioxidant,preservatives])
plt.xlabel('Number of products')
plt.ylabel('types of additives')
plt.show();
Comment:
This is the ranking of the 20 most used additives by number of products, and we distinguish the classes of additives.
In fact, this complete list of dangerous additives and their classes was taken from the World Health Organization and the FDA. And as you can see above, the most dangerous and popular additives in food products are "emulsifiers" (and "other") additives present in more than 100000 products in the dataset.
Sources:
https://www.fda.gov/food/food-additives-petitions/food-additive-status-list
https://www.who.int/en/news-room/fact-sheets/detail/food-additives
Comments:
In this part, we are going to study the impact of presence and absence of allergens on the nutrition grade.
More specifically, this part will also take into consideration the US-special case in order to compare it to the rest of the world.
df_us_allergens = partb_us[['nutrition_grade_fr', 'allergens']]
df_allergens_notnull = partb.allergens.dropna()
print('the Shape of the US Dataframe : ', partb_us.shape)
print('Number of Allergens Descriptions in the whole Dataframe : ', len(df_allergens_notnull.index))
print('Number of Allergens Descriptions in the US Dataframe : ', len(df_us_allergens.allergens.dropna().index))
partb_new_allergens = partb.allergens.dropna().map(lambda x : x.lower())
df_allergens = partb.iloc[partb_new_allergens.index]
df_result_allergens = pd.DataFrame(df_allergens.nutrition_grade_fr)
grade_dist("Grade Distribution for Allergens in the World", df_result_allergens)
partb_new_null_allergens = partb.allergens.isnull()
df_allergens_null = partb.iloc[partb_new_null_allergens.index]
df_result_allergens_null = pd.DataFrame(df_allergens_null.nutrition_grade_fr)
grade_dist("Grade Distribution for Non-Allergens in the World", df_result_allergens_null)
Comments:
From the world dataset, we wanted to compare the grade distributions in regards to the presence and absence of allergens.
Here, we do not have a lot of differences between the two plots, but we can notice a slight difference on the grades "b" and "e" which are ranked higher in terms of frequencies compared to their equivalent in the non-allergens distribution.
In this part, we also see that for we have a higher distribution of allergens in the worst grades (similar to additives).
partb_new_allergens_us = partb_us.allergens.dropna().map(lambda x : x.lower())
df_us_allergens = partb.iloc[partb_new_allergens_us.index]
df_result_us_allergens = pd.DataFrame(df_us_allergens.nutrition_grade_fr)
grade_dist("Grade Distribution for Allergens in the US",df_result_us_allergens)
partb_new_us_null_allergens = partb_us.allergens.isnull()
df_us_allergens_null = partb.iloc[partb_new_us_null_allergens.index]
df_result_us_allergens_null = pd.DataFrame(df_us_allergens_null.nutrition_grade_fr)
grade_dist("Grade Distribution for Non-Allergens in the US",df_result_us_allergens_null)
Comment:
On the other hand, the distribution of the allergens (compared to the non-allergens) on the US market, show a total imbalance on the nutrition grade. In fact, we have the same number of products with grade "b" and grade "d". This imbalance can also be explained by the low number of products with "allergens" labels on them. (845 for the US, and 82895 for the rest of the world) Hence, we can conclude that the impact of the allergens on the US products cannot be proven.
Comments:
Finally, we are going to study the use of palm oil ingredients (or ingredients that may be from palm oil) on the food consumption.
palm_list = ['ingredients_from_palm_oil_n',
'ingredients_from_palm_oil',
'ingredients_from_palm_oil_tags',
'ingredients_that_may_be_from_palm_oil_n',
'ingredients_that_may_be_from_palm_oil',
'ingredients_that_may_be_from_palm_oil_tags','nutrition_grade_fr']
df_palm = partb[palm_list]
df_us_palm = partb_us[palm_list]
print('the number of the Palm Oil products in the whole dataframe : ', len(df_palm.ingredients_from_palm_oil_n.dropna().index))
print('the number of the Palm Oil products in the US dataframe : ', len(df_us_palm.ingredients_from_palm_oil_n.dropna().index))
prod_dist('Distribution of the number of Palm Oil Ingredients by products','ingredients_from_palm_oil_n', 'Palm oil Ingredients' )
Comments:
Here we wanted to see the distribution of the number of palm oil ingredients in some of our products (in the world dataset). Hence, most of the products only contain one palm oil ingredient.
Nonetheless, we do have food products with two or three ingredients containing palm oil, which is inevitably bad for your health.
palm_oil_tags_new = partb.ingredients_from_palm_oil_tags.dropna().map(lambda x : x.lower())
df_palm_ingredients = partb.iloc[palm_oil_tags_new.index]
df_result_palm = pd.DataFrame(df_palm_ingredients.nutrition_grade_fr)
grade_dist("Grade Distribution for Palm Oil Products in the World",df_result_palm )
partb_new_null_palm_oil_tags = partb.ingredients_from_palm_oil_tags.isnull()
df_palm_oil_tags_null = partb.iloc[partb_new_null_palm_oil_tags.index]
df_result_palm_oil_tags_null = pd.DataFrame(df_palm_oil_tags_null.nutrition_grade_fr)
grade_dist("Grade Distribution for Non-Palm Oil Products in the World", df_result_palm_oil_tags_null)
Comments:
Regarding the world dataset, it can clearly be seen the the grade distribution for palm oil ingredients is left-skewed and it is normal that products containing palm oil should have the worst nutrition grades.
It also means that palm oil ingredients have a really bad impact on a person's health, and the grade system displays it.
palm_oil_tags_new_us = partb_us.ingredients_from_palm_oil_tags.dropna().map(lambda x : x.lower())
df_palm_us_ingredients = partb.iloc[palm_oil_tags_new_us.index]
df_result_us_palm = pd.DataFrame(df_palm_us_ingredients.nutrition_grade_fr)
grade_dist("Grade Distribution for Palm Oil Products in the US", df_result_us_palm)
print('The number of products studied in the US Palm Oil Dataframe is only : ', len(df_palm_us_ingredients.index))
#create a column that will be used in order to compute the number of suspected palm oil ingredients in a product
df_countries['palm_oil_n'] = df_countries['ingredients_from_palm_oil_n'] + df_countries['ingredients_that_may_be_from_palm_oil_n']
df_countries_palm = df_countries[df_countries.palm_oil_n.notnull()]
df_groupedby_countries_palm = df_countries_palm.groupby(['countries']).mean().palm_oil_n.reset_index()
np_countries_palm = np.array(df_groupedby_countries_palm)
np_countries_palm = np_countries_palm[np_countries_palm[:,1].argsort()[::-1]]
# Plot the average number of additives per country
fig = plt.figure(figsize=(23,8))
ax1 = fig.add_subplot(1,1,1)
y_pos = np.arange(len(np_countries_palm[:,0]))
x_pos = np_countries_palm[:,1]
x_ticks = np_countries_palm[:,0]
# Make a barplot
plt.bar(y_pos, x_pos, align='center')
plt.title('Average number of Palm oil ingredients per product by country')
plt.xticks(y_pos, x_ticks)
plt.ylabel('Average number of Palm oil ingredients')
plt.show()
import plotly.express as px
import country_converter as coco
cc = coco.CountryConverter()
#we create here the ISO-3 country codes in order to plot our map in plotly
df_groupedby_countries_palm['iso_code'] = df_groupedby_countries_palm['countries'].apply(lambda x : coco.convert(names=x, to='ISO3'))
#we also drop
df_groupedby_countries_palm = df_groupedby_countries_palm[df_groupedby_countries_palm['palm_oil_n'] != 0]
fig = px.choropleth(df_groupedby_countries_palm, locations="iso_code",
color="palm_oil_n", # Palm_oil_n is a column of our dataframe
hover_name="countries", # column to add to hover information
color_continuous_scale=px.colors.sequential.algae)
fig.show()
Comments:
To conclude, we wanted to see the distribution of food products containing palm oil ingredients in regard to the nutrition grade.
Hence, with no surprise, we see that the number of products increase gradually with the negative grades (the most number of products have the worst grade which is "e") in the world dataset.
However, we cannot conclude anything the US dataset because we only have 3 products containing in their labels the "palm oil" tag. This can be explained by the fact that many US lobbies want their products to continue using palm oil without informing the public. (source "The Guardian")
Finally, an important result is the ranking of countries that have products suspected to use palm oil ingredients. The countries that use a lot of palm oil ingredients are algeria, south africa and belgium.
However, it is surprising to see countries such as australia, germany and united states at the bottom of this ranking. This can easily be explained by the fact that these countries have really "light" food regulations on palm oil (many of our sources confirm this statement).
Sources:
https://www.theguardian.com/news/2019/feb/19/palm-oil-ingredient-biscuits-shampoo-environmental
https://www.foodstandards.gov.au/consumer/generalissues/palmoil/Pages/default.aspx
https://mobil.wwf.de/fileadmin/fm-wwf/Publikationen-PDF/WWF_Report_Palm_Oil_-_Searching_for_Alternatives.pdf
https://www.alumniportal-deutschland.org/en/global-goals/sdg-12-consumption/palm-oil-rainforest-indonesia/
https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfcfr/CFRSearch.cfm?fr=101.4
sns.set(style="white")
#list of correlation elements that we will study
list_additives_palm = ['nutrition_grade_fr', 'additives_n', 'ingredients_from_palm_oil_n', 'ingredients_that_may_be_from_palm_oil_n' ]
#correlation matrix function specific to our study
def corr_matrix(df_test, title):
fig = plt.figure(figsize=(20,20))
df_test_new = df_test[list_additives_palm]
df_test_new = pd.get_dummies(df_test_new, columns=['nutrition_grade_fr'])
plt.matshow(df_test_new.corr())
plt.title(title)
plt.show();
corr_matrix(partb, 'Nutrition Grade Correlation Matrix on the World');
corr_matrix(partb_us, 'Nutrition Grade Correlation Matrix on the World');
Comments:
Here we wanted to plot the correlation matrices regarding our findings in the 3 previous parts : we only find a slight correlation between additives and palm oil (0 and 2 columns).
We proceeded to cleaning each one of the two datasets, then analyzed most of their features as well as their correlations. Now the two datasets are ready to be joined. these products. The idea is to complete the Dunnhumby dataset by adding to products purchased by households information about their ingredients and nutritive value, in order to evaluate the healthiness of the food product. Dunnhumby classifies food products at different levels, Subcommodity, commoditiy and Department. While OpenFoodFacts has a Product name attribute and a Category feature. Comparing these features we decided to merge the two datasets based on the subcommodity attribute for Dunnhumby and the product name for OpenFoodFacts. Indeed, these have the closest product classification level. However the labels are not exactly the same between Subcommodity and Product name. We need to implement a function that finds the product in OpenFoodFacts with the closest name to a subcommodity in Dunnhumby. For this we used changed the name of the product on both dataset by applying the same methodology :
First,we applied Casfolding (everthing lower-case) and tokenization this helped to have less sparsity between the datasets.
Then, we applied stopword removal of nltk for english, and to boost the performance of our analysis we went further and created a list of words that are usell for our tast (like: brands name,color,etc...)
Furthemeor, we applied lemmatization every word using WordNetLemmatizer of nltk library.
Finaly we used the Difflib function SequenceMatcher, which compares two strings and provides a ratio indicating the ‘closeness’ of the two names. Using a threshold method we keep the pairs found with a ratio above 0.7. We also add two rules in order to give priority to exact matches, and to the product names included in the subcommodity name or vis-versa.
We constructed a dictionary on products name matches between Dunnhumby (keys) and OpenFoodFacts(values). We can notice that we have a lot of exact matches, and a big number of good results. We are unable to build a complete dictionary which allows us to assign all the products present in Dunhmby to a product present in OpenFoodFacts. Only on third of the products listed in dunhmby have a match with a product in OpenFoodFacts.
We decided to keep the shown dictionary for this report,and exctract some meaningfull information.
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import seaborn as sns
from sklearn import manifold, datasets
from adjustText import adjust_text
from difflib import SequenceMatcher
import gensim
from gensim.utils import *
from difflib import SequenceMatcher
import pandas as pd
import numpy as np
import csv
lemmatizer = nltk.stem.WordNetLemmatizer()
def lemmatize_text(txt):
#Apply case folding
txt = txt.lower()
#Aplly tokenization
txt = simple_preprocess(txt)
#Lemmatization
lemmatized_words = [lemmatizer.lemmatize(word) for word in txt]
#Stopword removal
stop_words = list(set(stopwords.words('english')))
other_stop_words = ['natural', 'original', 'organic', 'except', 'variety', 'thai', 'heaven', 'divine', 'heavenly', 'divinely', 'refined', 'food', 'quick',
'style', 'store', 'best', 'company', 'magic', 'gourmet', 'guiltless', 'other', 'super', 'prem', 'mini', 'dark', 'aunt', 'tricolor', 'freshly', 'williams', 'lundberg',
'golden', 'olde', 'warner', 'world', 'hudson', 'ryan', 'emojeez', 'merry', 'christmas', 'yam', 'yan', 'grove', 'confectionery', 'co', 'halloween', 'st', 'pat',
'deluxe', 'premium', 'gold', 'best', 'petite', 'lindor', 'family', 'kid', 'sprungli', 'passion', 'delicious', 'hello', 'panda', 'special', 'large', 'party',
'mix', 'assorted', 'round', 'creation', 'covered', 'bite', 'lindt', 'cooked', 'island', 'icicle', 'fine', 'rocher', 'four' , 'psst', 'homestyle', 'real', 'finely',
'monterey', 'jack', 'imitation', 'vit', 'part', 'artic', 'blaster', 'sammies', 'havarti', 'zesty', 'traditional', 'old', 'fashioned', 'microwave', 'kettle', 'movie',
'theater', 'classic', 'add', 'piece', 'peeled', 'segment', 'truth', 'simple', 'product', 'harvest', 'handpicked', 'added', 'paul', 'mr',
'fisherman', 'alfredo', 'fettuccini', 'cellas', 'arborio' , 'naan', 'dahi', 'luxury', 'dreamhouse','harvest','calbee', 'acidophilus',
'probiotic','triple','cleaned', 'instant', 'homestyle', 'stadium', 'hawaiian', 'koepplinger', 'classic','woodfield', 'bunker','hill','mountain',
'frutel' ,'evaporated', 'added','opaa','conchita', 'squeeze' ,'cha','act','ii','kln','brand','happy','valentine','day','sampler','vitarroz',
'tam','fun','tillamookies','richardson','mr','florentine','yamamotoyama', 'loretta','joseph','flax','ala','wan','ja','wanjashan','wismettac','free',
'range','aa','froyo','garcia','boxed', 'baby', 'adult', 'child', 'assortment']
stop_words.extend(other_stop_words)
filtered_words = [word for word in lemmatized_words if word.lower() not in stop_words]
#Remove redunduncy in the final product name
unique_words = list(set(filtered_words))
txt = ' '.join(unique_words)
return txt
def most_frequent(List):
counter = 0
num = List[0]
for i in List:
curr_frequency = List.count(i)
if(curr_frequency> counter):
counter = curr_frequency
num = i
return num
def age_function(x):
n = len(demographic.AGE_DESC.value_counts())
values = demographic.AGE_DESC.value_counts().sort_index(ascending=True).index
for i in range(3):
for j in range(int(n/3)):
if(i == 0 and x == values[j]):
return '19-34'
elif(i == 1 and x == values[j + int(n/3)]):
return '35-54'
elif(i == 2 and x == values[j + 2*int(n/3)]):
return '55-65+'
def income_function(x):
n = len(demographic.INCOME_DESC.value_counts())
values = demographic.INCOME_DESC.value_counts().sort_index(ascending=True).index
for i in range(4):
for j in range(int(n/4)):
if(i == 0 and x == values[j]):
return '-15-34K'
elif(i == 1 and x == values[j + int(n/4)]):
return '035-99K'
elif(i == 2 and x == values[j + 2*int(n/4)]):
return '100-174K'
elif(i == 2 and x == values[j + 3*int(n/4)]):
return '175-250K+'
def grade_food(x):
a = x.count('a')/len(x)
b = x.count('b')/len(x)
c = x.count('c')/len(x)
d = x.count('d')/len(x)
e = x.count('e')/len(x)
counts = [a, b, c, d, e]
labels = ['a','b','c','d','e']
best = max(counts)
return labels[counts.index(best)]
Handling text for Openfoodfacts:
openfoodfacts = pd.read_csv('dataset/OpenFoodFacts_processed.csv')
#Drop all the products in OpenFoodFacts with no nutrition grades
openfoodfacts = openfoodfacts.dropna(subset = ['nutrition_grade_fr'])
#Apply our text processing pipline
openfoodfacts['openfoodfacts_comun_name'] = openfoodfacts.product_name.apply(lemmatize_text)
#In ordre to remove some reduncy in our Dataframe, we decided for each product name which is not unique to keep only the most
#frequent nutrition grade and the mean of all Macronutrients
nt_grade = openfoodfacts.groupby(['openfoodfacts_comun_name']).nutrition_grade_fr.apply(list).reset_index().rename(columns={'openfoodfacts_comun_name':'openfoodfacts_comun_name'})
nt_grade.nutrition_grade_fr = nt_grade.nutrition_grade_fr.apply(most_frequent)
openfoodfacts = openfoodfacts.drop(columns = ['nutrition_grade_fr'])
openfoodfacts_dict = pd.merge(nt_grade,openfoodfacts,on = 'openfoodfacts_comun_name')
macro = openfoodfacts_dict.groupby(['openfoodfacts_comun_name'])['energy_100g','fat_100g','carbohydrates_100g','sugars_100g','proteins_100g','salt_100g'].median().reset_index().rename(columns={'openfoodfacts_comun_name':'openfoodfacts_comun_name'})
openfoodfacts_dict = openfoodfacts_dict.drop(columns = ['energy_100g','fat_100g','carbohydrates_100g','sugars_100g','proteins_100g','salt_100g'])
openfoodfacts_dict = pd.merge(macro,openfoodfacts_dict,on = 'openfoodfacts_comun_name')
openfoodfacts_dict = openfoodfacts_dict.drop_duplicates(subset=['openfoodfacts_comun_name'], keep='last')
openfoodfacts_dict = openfoodfacts_dict.drop(openfoodfacts_dict[openfoodfacts_dict.openfoodfacts_comun_name.apply(lambda x: x=='')].index)
#Drop all the columns that are not interrsting for our analysis
openfoodfacts_dict = openfoodfacts_dict.drop(columns = ['allergens','allergens.1', 'ingredients_from_palm_oil', 'additives','countries_en', 'additives_tags', 'sum_elements', 'other_carbs',
'reconstructed_energy'])
Handling text for Dunhumby:
demographic = pd.read_csv('dataset/hh_demographic.csv', delimiter=',')
demographic.INCOME_DESC = demographic.INCOME_DESC.apply(correct_indecome)
demographic.MARITAL_STATUS_CODE = demographic.MARITAL_STATUS_CODE.apply(correct_marital_status)
demographic.MARITAL_STATUS_CODE = demographic.apply(lambda row: replace_unknowns(row['HH_COMP_DESC'],
row['MARITAL_STATUS_CODE'], row['HOUSEHOLD_SIZE_DESC']),axis=1)
demographic['categorical_age'] = demographic.AGE_DESC.apply(age_function)
demographic['categorical_income'] = demographic.INCOME_DESC.apply(income_function)
transaction = pd.read_csv('dataset/transaction_data.csv', delimiter=',')
demographic_households = demographic.household_key.unique()
transaction = transaction[transaction.household_key.apply(lambda x: x in demographic_households)]
product = pd.read_csv('dataset/product.csv', delimiter=',')
list_food = ['GROCERY','PASTERY','PRODUCE','NUTRITION','MEAT','FROZEN GROCERY','SALAD BAR','SEAFOOD','SPIRITS','SEAFOOD-PCKGD','MEAT-PCKG','DELI']
food_related_products = product[product.DEPARTMENT.apply(lambda x: x in list_food)]
food_related_products.drop(columns = ['BRAND','CURR_SIZE_OF_PRODUCT','MANUFACTURER'],inplace = True)
consumution = transaction.drop(columns = ['STORE_ID','RETAIL_DISC','TRANS_TIME','COUPON_DISC','COUPON_MATCH_DISC','BASKET_ID','DAY','WEEK_NO'])
#Apply our text processing pipline our the product dataframe (for efficiency)
food_related_products['dunhumby_comun_name'] = food_related_products.SUB_COMMODITY_DESC.apply(lemmatize_text)
food_related_products = food_related_products.drop(food_related_products[food_related_products.dunhumby_comun_name.apply(lambda x: x=='')].index)
food_related_products = food_related_products.drop_duplicates(subset=['dunhumby_comun_name'], keep='last')
#Join product dataframe with demographic and transcation dataframe
inter = pd.merge(food_related_products,consumution,on = 'PRODUCT_ID')
full_data_set = pd.merge(inter,demographic,on = 'household_key')
#Function to creat our dictionnary
flag = False
dic = {}
for i in food_related_products.dunhumby_comun_name:
print(i)
d1 = {}
d2 = {}
flag = False
flag_in = False
for j in openfoodfacts_dict.openfoodfacts_comun_name:
#Condition if both word are the same
if i == j:
dic[i] = j
flag = False
flag_in = False
break
#Condition if one word is contained in the other word
elif (i in j) or (j in i):
s = SequenceMatcher(None, i, j)
d2[j] = s.ratio()
flag_in = True
else:
s = SequenceMatcher(None, i, j)
d1[j] = s.ratio()
flag = True
#Extract the best score comparing the the previous conditions
if flag_in:
m2 = max(d2.values())
key_list2 = list(d2.keys())
val_list2 = list(d2.values())
if flag:
m1 = max(d1.values())
key_list1 = list(d1.keys())
val_list1 = list(d1.values())
if m2>=m1 and m2>=0.7:
print('dunhumby : ', i,', openfoodfacts : ', key_list2[val_list2.index(m2)],', ratio :', m2, ', in')
dic[i] = key_list2[val_list2.index(m2)]
else:
if m1 >= 0.75:
print('dunhumby : ', i,', openfoodfacts : ', key_list1[val_list1.index(m1)],', ratio :', m1)
dic[i] = key_list1[val_list1.index(m1)]
else:
print('dunhumby : ', i,', openfoodfacts : ', key_list2[val_list2.index(m2)],', ratio :', m2, ', in')
dic[i] = key_list2[val_list2.index(m2)]
elif flag:
m = max(d1.values())
key_list = list(d1.keys())
val_list = list(d1.values())
if m>=0.75:
print('dunhumby : ', i,', openfoodfacts : ', key_list[val_list.index(m)],', ratio :', m)
dic[i] = key_list[val_list.index(m)]
w = csv.writer(open("merge.csv", "w+"))
for key, val in dic.items():
w.writerow([key, val])
w.close()
#The dictionary is created once, du random lemmatization
dictionaire = pd.read_csv('dataset/merge.csv',header=None)
openfoodfacts_dict = pd.read_csv('dataset/dictionaire.csv',header=None)
dictionaire.columns = ['dunhumby_comun_name','openfoodfacts_comun_name']
dictionaire = pd.read_csv('dataset/merge.csv',header=None)
dictionaire.columns = ['dunhumby_comun_name','openfoodfacts_comun_name']
inter_medaire = pd.read_csv('dataset/inter_medaire.csv')
grades = pd.read_csv('dataset/grades.csv')
Comments:
From the plot, we can conclude that our dictionary has kept products that are consumed in our daily life, without significantly changes in the order of the most consumed products.
import matplotlib.pyplot as plt
plt.figure(num = None, figsize = (10,10))
ax = grades.dunhumby_comun_name.value_counts()[:10].plot.bar()
ax.title.set_text('Most frequent product')
plt.show()
Comments:
An important step before any analysis is the check if the amount of producs present in the dataset for each nutrion grade (to look for balanced or unbalanced categories). We can clearly observe that the dataset is slighty unbalanced in favor of nutrition grade 'd'.
plt.figure(num = None, figsize = (10,10))
ax = inter_medaire.nutrition_grade_fr.value_counts().sort_index().plot.bar()
ax.title.set_text('Most frequent product')
plt.show()
Comments:
First let us start with the impact of the income on the nutrition grade. Surprisingly, the results are in accordance with our hypothesis. Households with an income between 15K and 50K dollars tend to buy food products with low nutrition grade. As the income grows higher, the nutrition grade of consumed foods improves. For incomes between 175K and 250K dollars, we can see that the most frequent nutrition grade of consumed food products is very high. This shows that income has a direct relationship with how healthy a household eats. Wealthy households seem to eat a lot healthier than others with lower income. However, we can observe that for the super rich ones, the nutrition grade drops dramatically.Let us now examine these observations in details. Considering the nutrition grade only is not completely representative of these influences. We complete it by looking at the food composition..
#Compute the most freqente nutrition grade consumed by each income category
categorical_income_nutrition = grades.groupby(['INCOME_DESC']).nutrition_grade_fr.apply(list).reset_index().rename(columns={'categorical_income':'categorical_income'})
categorical_income_nutrition.nutrition_grade_fr = categorical_income_nutrition.nutrition_grade_fr.apply(grade_food)
categorical_income_nutrition
categorical_income_macro = grades.groupby(['INCOME_DESC','household_key'])['energy_100g','fat_100g','carbohydrates_100g','sugars_100g','proteins_100g','salt_100g'].median().reset_index().rename(columns={'openfoodfacts_comun_name':'openfoodfacts_comun_name'})
categorical_income_macro = categorical_income_macro.groupby(['INCOME_DESC'])['energy_100g','fat_100g','carbohydrates_100g','sugars_100g','proteins_100g','salt_100g'].mean().reset_index().rename(columns={'openfoodfacts_comun_name':'openfoodfacts_comun_name'})
categorical_income_macro
Comments:
Let's analyse the distribution of macro for each categorical income.
Low income households seem to consume more fat and proteins than others, they also seem to pick in general high energetic products. Wealthy households, in contrast, has a much more balanced diet with more or less equal moderate quantities of fat, carbohydrates and proteins, and choose products with lower energy value.
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
categorical_income_macro_scaled = pd.DataFrame(scaler.fit_transform(categorical_income_macro[['energy_100g','fat_100g',
'carbohydrates_100g','proteins_100g']]),columns = ['energy_100g','fat_100g',
'carbohydrates_100g','proteins_100g'])
categories=list(categorical_income_macro_scaled.columns)
N = len(categories)
# What will be the angle of each axis in the plot? (we divide the plot / number of variable)
angles = [n / float(N) * 2 *np. pi for n in range(N)]
angles += angles[:1]
ax = plt.subplot(111, polar=True,autoscale_on = True)
ax.set_theta_offset(0)
ax.set_theta_direction(-1)
plt.xticks(angles[:-1], ['Ration energy','Ration fat','Ration carbohydrates','Ration proteins'])
ax.set_rlabel_position(0)
plt.yticks([0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0], ['0.1','0.2','0.3','0.4','0.5','0.6','0.7','0.8','0.9','1.0'], color="grey", size=7)
plt.ylim(0,1)
labels = categorical_income_macro.INCOME_DESC.values
# Ind1
for i in range(12):
values= categorical_income_macro_scaled.iloc[i][:].values.flatten().tolist()
values += values[:1]
ax.plot(angles, values, linewidth=1, linestyle='solid', label= labels[i])
ax.fill(angles, values, 'b', alpha=0.1)
# Add legend
plt.legend(loc='upper right', bbox_to_anchor=(0.1, 0.1))
Comments:
We tehn compared the influcence of the age on the nutrition grade. Surprisingly, the resultas are mixed this time, there is no real trends. The different age categories have the consumes mostly products with a nutrition grade of 'd', except the categories (65+ and 45-54).Considering the nutrition grade only is not completely representative of these influences. We complete it by looking at the food composition.
#Compute the most freqente nutrition grade consumed by each age category
categorical_age_nutrition = grades.groupby(['AGE_DESC']).nutrition_grade_fr.apply(list).reset_index().rename(columns={'categorical_income':'categorical_income'})
categorical_age_nutrition.nutrition_grade_fr = categorical_age_nutrition.nutrition_grade_fr.apply(grade_food)
categorical_age_nutrition
Comments:
If we do the same for the age category, we see that young households tend to consume more fats and sugars (carbohydrates) as well as high energetic products. Middle aged households seem to have a well balanced nutrition, with low fats and low sugar. Also the food products they choose have low energetic value. Seniors, on their side, consume very low sugar, but have a diet with a lot proteins and fat, and the products they buy are very energetic.
categorical_age_macro = grades.groupby(['AGE_DESC','household_key'])['energy_100g','fat_100g','carbohydrates_100g','sugars_100g','proteins_100g','salt_100g'].median().reset_index().rename(columns={'AGE_DESC':'AGE_DESC'})
categorical_age_macro = categorical_age_macro .groupby(['AGE_DESC'])['energy_100g','fat_100g','carbohydrates_100g','sugars_100g','proteins_100g','salt_100g'].mean().reset_index().rename(columns={'AGE_DESC':'AGE_DESC'})
categorical_age_macro
categorical_age_macro_scaled = pd.DataFrame(scaler.fit_transform(categorical_age_macro[['energy_100g','fat_100g',
'carbohydrates_100g','proteins_100g']]),columns = ['energy_100g','fat_100g',
'carbohydrates_100g','proteins_100g'])
categories=list(categorical_age_macro_scaled.columns)
N = len(categories)
# What will be the angle of each axis in the plot? (we divide the plot / number of variable)
angles = [n / float(N) * 2 *np. pi for n in range(N)]
angles += angles[:1]
ax = plt.subplot(111, polar=True,autoscale_on = True)
ax.set_theta_offset(0)
ax.set_theta_direction(-1)
plt.xticks(angles[:-1], ['Ration energy','Ration fat','Ration carbohydrates','Ration proteins'])
ax.set_rlabel_position(0)
plt.yticks([0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0], ['0.1','0.2','0.3','0.4','0.5','0.6','0.7','0.8','0.9','1.0'], color="grey", size=7)
plt.ylim(0,1)
labels = categorical_age_macro.AGE_DESC.values
for i in range(6):
values=categorical_age_macro_scaled.iloc[i][:].values.flatten().tolist()
values += values[:1]
ax.plot(angles, values, linewidth=1, linestyle='solid', label=labels[i])
ax.fill(angles, values, 'b', alpha=0.1)
plt.legend(loc='upper right', bbox_to_anchor=(0.1, 0.1))
Comments:
As we expected, we could observe that there is a direct link between social status and the healthiness of consumed food. But, we would like to know if these observations can be statistically confirmed. We take a look at the correlations between income, age, food prices and nutrition grade.
We see that there are some slight correlations between high income and high nutrition grades, middle incomes seem to be slightly correlated as well with middle nutrition grades, while low incomes are correlated with low nutrition grades.
For the age, the correlation with nutrition grade seems to be linear, meaning that young age categories are slightly correlated with low nutrition grades, and the older the households are, the higher is the correlation with a better nutrition grade.
one_hot_income = pd.get_dummies(grades, columns=["AGE_DESC",'INCOME_DESC','nutrition_grade_fr'])
corr = one_hot_income.corr(method = 'pearson')
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
cmap = sns.diverging_palette(220, 10, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
fig, ax = plt.subplots(figsize=(10,10)) # Sample figsize in inches
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5});
Food consumption is at the heart of many debates : the human race has always centered its attention toward this basic need which defines us.
During this study, we finally established a connection between social status and food consumption. There is now a clear common trend between wealthiness and healthiness.
But this trend truly highlights the fact that cultures from around the world and especially the United States will share common food consumption trends in regards to this perpetual need of health, which means : less additives, less palm oil... less excess to sum it up!
For further work, it would be great to have access to a database similar to dunnhumby but applied to more retailers from around the world. Naturally, we also want more data in the Open-Food-Facts Database in order to apply some of our previous analysis techniques on many more countries previously ignored because of a lack of data.
Most importantly, some fresh ideas could be applied in this study, by finding a more "social" database that could identify patterns between any social network (friends, family, political orientation, religion etc...) and our daily food consumption in order to have a more comprehensive and broad point of view !